Police line image

Contents

Police line image

Presentation. Intro

What problem is solving here?

  • Community policing progressives want and try to advance policing [1]. But less than half of the major police departments participating in the initiative now. We want to show at least the limited pool of insights stakeholders of all interests may gather from publishing their data.

To contents


Why is it important?

  • To increase the transparency of public safety-related policy-making & budget planning via data analysis.
    • To make a public audit of FBI reports.
  • To utilize important public data & to inform tourists and locals about the safety of a city part and outcomes of taxation.
  • To support initiative of progressive police departments via 'citizen'-data-analysis.

To contents


What data are used here?

Source: contribution of Detroit, MI to “Police Data Initiative”

Nature: reports from police information management system

Sample:

Pipeline

To contents


What processing and analytics pipelines are used to solve a problem?

Pipeline

  1. Preprocessing:
    1. Raw data cleaning and feature engineering. Python is used
    2. Load data to MongoDB in cloud
  2. Analytics and visualizations:
    1. Access the data from everywhere. Python is used
    2. Analitics. Python used
    3. Visualisation. Python, kepler.gl for 3d are used
  3. Presentations of trends and insight
    1. Jupyter notebook is used
    2. Power Point is used
    3. Thematic web site is used

To contents


What technologies are used to solve a problem?

Technologies used

  • Plus, Datashader & Holoviews (mostly, matplotlib backend)

To contents


Aux. scripts

Import packages, then load data untraditional way, via connection to DB or traditional way, via reading operation from local disk.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datashader as ds
import datashader.transfer_functions as tf
import holoviews as hv
from holoviews.operation.datashader import datashade
from holoviews import opts, dim
hv.extension('matplotlib')
from colorcet import fire
datashade.cmap = fire[50:]

import src.db_connector
PATH_TO_PROC_DATA = 'data\\interim\\'

Load or read data

In [2]:
%%time
LOAD_REQ = False
if LOAD_REQ == True:
    df = pd.DataFrame(src.db_connector.get_many_items('crimes', 'detroit')) # return iterator, so need container
else:
    df = pd.read_csv(PATH_TO_PROC_DATA + 'RMS_Crime_Incidents_modified.csv', index_col=0, low_memory=False)
# extra preprocess for categories       
df['Crime Against'] = df['Crime Against'].astype('category')
df['year'] = df['year'].astype('category')
df['Crime Against codes'] = df['Crime Against'].cat.codes
Wall time: 1.89 s

Presentation. Insights

General insights:

  • Numer of reported crimes:

    • The median number of offenses registered per year is equals to $81 400$;
    • The median number of offenses registered per hour per year is equals $2175$
  • Zip codes of reported offenses

    • $50%$ of Detroit's zip codes have more than $7740$ offenses registered during the whole period,
    • $15%$ of Detroit's zip codes have less than $2200$ offenses registered during the whole period
  • Reported crimes by day and by week distributions

    • The distribution of offenses registered per day of the week of the whole period:
      • Hour with the most number is 4.00 AM (Detroit time).
    • The hour distribution of offenses registered per the whole period:
      • The weekday with the most number of crimes reported is Friday (Detroit time).
Per hour Per weekday
By hour By weekday

SEX related:

  • Prostitution is active on Wednesday, Friday and Tuesday (in order of increasing), in the evening at night, while another week is Criminal Sexual Conduct 2nd and 4th degree. There are up to 3 most active prostitution-related offenses addresses.
    • Wednesday is a day of Oral/Anal Criminal Sexual Conducts and Penis/Vagina Criminal Sexual Conducts day is Saturday.
      • East part is strangely related to sex-related offenses.

General:

  • Car thefts are same active over a week

To reproduce the numbers:

  • df.groupby(['year']).crime_id.count().quantile(0.5)
  • df.groupby(['year', 'incident_timestamp_dt_hour']).crime_id.count().median()

  • df.groupby('incident_timestamp_dt_hour').crime_id.count().plot.bar()

  • df.groupby('incident_timestamp_dt_day_of_week').crime_id.count().plot.bar()
  • this plots are upper at a file
  • df[(df.offense_description.str.contains('PROSTITUT'))].groupby('incident_timestamp_dt_day_of_week').crime_id.count().plot.bar()
  • df[df.offense_category.str.contains('SEX OFFENSES')].groupby('incident_timestamp_dt_day_of_week').crime_id.count().plot.bar()
  • df[(df.offense_description.str.contains('PROSTITUT'))].groupby(['X', 'Y']).count().crime_id.sort_values()
  • `df[(df.offense_description.str.contains('PROSTITUT'))].groupby(['address']).count().crime_id.sort_values()

Aux. visualisations

Let's plot some auxilary statistics plots, which will be applied to general insights.

Some interesting insights:

SEX related:

  • Prostitution is active on Wednesday, Friday and Tuesday (in order of increasing), in the evening at night, while another week is Criminal Sexual Conduct 2nd and 4th degree. There are up to 3 most active prostitution related offences adressed.
    • Wednesday is a day of Oral/Anal Criminal Sexual Conducts and Penis/Vagina Criminal Sexual Conducts day is Saturday.
    • East part is strangely related to sex-related offenses.

General:

  • Car thefts are active the whole week

df[(df.offense_description.str.contains('PROSTITUT'))].groupby('incident_timestamp_dt_day_of_week').crime_id.count().plot.bar() df[df.offense_category.str.contains('SEX OFFENSES')].groupby('incident_timestamp_dt_day_of_week').crime_id.count().plot.bar() df[(df.offense_description.str.contains('PROSTITUT'))].groupby(['X', 'Y']).count().crime_id.sort_values() df[(df.offense_description.str.contains('PROSTITUT'))].groupby(['address']).count().crime_id.sort_values()

Overall offense number per different time-slices.:

  • Per month of a year
  • Per day of a month
  • Per day of a week
  • Per hour of a day

Per month of a year

In [3]:
df.groupby('incident_timestamp_dt_month').crime_id.count().plot.bar(ylim=(15000,26000), zorder=2,
                                                                         figsize=(10, 5))
plt.xlabel('Ordered number of a month, ordered months')
plt.ylabel('Amount of offenses, per period')
plt.grid(alpha=0.5, axis='y', zorder=-1)
# plt.savefig('overall_per_month_detroit_time.png', dip=400)

Per day of a month

In [4]:
df.groupby('incident_timestamp_dt_day_of_month').crime_id.count().plot.bar(ylim=(2000,12000), zorder=2,
                                                                         figsize=(10, 5))
plt.xlabel('Ordered number of a day, ordered by day of a month')
plt.ylabel('Amount of offenses, per period')
plt.grid(alpha=0.5, axis='y', zorder=-1)
#plt.savefig('overall_per_day_of_month_detroit_time.png', dip=400)

Per day of a week

In [5]:
df.groupby('incident_timestamp_dt_day_of_week').crime_id.count().plot.bar(ylim=(35000,40000), zorder=2,
                                                                         figsize=(10, 5))
plt.xlabel('Ordered number of a day, Monday=0, Sunday=6')
plt.ylabel('Amount of offenses, per period')
plt.grid(alpha=0.5, axis='y', zorder=-1)
# plt.savefig('overall_per_weekday_detroit_time.png', dip=400)

Per hour of a day

In [6]:
df.groupby(['incident_timestamp_dt_hour']).crime_id.count().plot.bar(zorder=2,
                                                                         figsize=(10, 5))
plt.xlabel('Hours, ordered number')
plt.ylabel('Amount of offenses, per hour')
plt.grid(alpha=0.5, axis='y', zorder=-1)
# plt.savefig('overall_per_hour_detroit_time.png', dip=400)

Geo

We have reported crimes and offenses and its' known geo-locations. So, we may combine and synergize it. We will use several approaches:

  • 'Layered maps', where each layer means some category. It allows us to evaluate, e.g. temporal and other categorical trends as they were the separate dimensions within the graphical representation. Whereas the limits of the city stayed unchanged
  • 'Density-evaluation-maps and heat-maps', where the amount of 'heat' or density distribution peaks show us the main location of our interests. They might be as well 2 space--dimensional and require some abstract thinking or 3 space-dimensional and show easer representation

* Also, the 'inner parts' are not parts without offenses and crimes, but two historically non-Detroitian municipalities because of the tax reasons. They don't share statistics with our source, and sometimes even hadn't the police

Police line image

In addition, it required to add and use limits of citi's inner administrative division, as municipalities and districts to gather deeper trends in our statistics.

To contents


The number of reported offenses and crimes occurring every 30 minutes over the whole analyzing period

Let's look at the peak points. Let's look at the geo-distribution. Be familiar with the cities' parts

To see the dynamic version, open the Kepler-based web interface (may spend time depends on the device)


All reported offeses and crimes, per period

Geospatial distribution, gridded, of reported offenses and crimes per year

In [7]:
fig = hv.Scatter3D((df.X, df.Y, df.year),
                   kdims=['X', 'Y'], vdims=['year'])
fig.opts(hv.opts.Scatter3D(azimuth=65, elevation=15, s=0.2, alpha=0.3,
                           fig_size=600, show_legend=True, color='year', cmap='Category20',
                           fontsize=20))
fig.relabel("Offenses and crimes, overall per period, 2016 - curr.")
# plt.savefig(PATH_TO_IMAGES + 'Offenses and crimes, overall per period, 2016-curr.png', dip=600)
Out[7]:

All reported offeses and crimes, per category

Geospatial distribution, gridded, of reported offenses and crimes per crime category

Crimes and offenses mihg by grouped by its meta-category from the National Incident-Based Reporting System:

  • Against person
  • Against society
  • Against property
  • Others, mostly non-crimes
In [8]:
print(*df['Crime Against'].cat.categories, sep=', ')
Another, Person, Property, Society
In [9]:
fig = hv.Scatter3D((df.X, df.Y, df['Crime Against codes']), 
                   kdims=['X', 'Y'], vdims=['Crime Against codes'])
fig.opts(hv.opts.Scatter3D(azimuth=70, elevation=20, s=0.2, alpha=0.3, fig_size=600, show_legend=True, 
                           color='Crime Against codes', cmap='fire',
                           fontsize=20))
fig.relabel("Offenses and crimes, overall per category, per period, 2016 - curr.")
# plt.savefig(PATH_TO_IMAGES + 'Offenses and crimes, overall per category per period, 2016-curr.png', dip=600)
Out[9]:

All reported offeses and crimes, per year per category

In [10]:
print(*df['Crime Against'].cat.categories, sep=', ')
Another, Person, Property, Society
In [11]:
# https://justinbois.github.io/bootcamp/2019/lessons/l34_holoviews.html
# https://github.com/holoviz/holoviews/issues/1794
# https://justinbois.github.io/bootcamp/2019/lessons/l34_holoviews.html
fig1 = hv.Scatter3D(data=df, 
                    kdims=['X', 'Y'], 
                    vdims=['Crime Against codes', 'year']).groupby(['year'])
fig1.opts(hv.opts.Scatter3D(azimuth=70, elevation=20, s=0.2, alpha=0.3,
                            fig_size=600, show_legend=True, cmap='fire',
                            fontsize=20))
fig1.relabel("Offenses and crimes per category per category per year")
Out[11]:

(the part below is under reconstruction)

Densities of reported crimes and offenses by category, per year

Let's see on the densities of our crimes and offenses

We saw this peaks above, but statistical algorithms may prove or dis-prove our views about places with peaks.

In [12]:
%%time
for cat_ag in sorted(df['Crime Against'].unique(), reverse=True)[0:3]:
    try: # becouse the probable oversize of points lets take a sample
        df_to_plot = df[df['Crime Against'] == cat_ag].sample(40000) # normal sample to limit computation resources spend
        sns.jointplot(df_to_plot[df_to_plot['Crime Against'] == cat_ag]['X'], 
                      df_to_plot[df_to_plot['Crime Against'] == cat_ag]['Y'], kind='kde',  
                      figsize=(20, 20))
    except:
        sns.jointplot(df[df['Crime Against'] == cat_ag]['X'], df[df['Crime Against'] == cat_ag]['Y'], kind='kde', 
                      figsize=(20, 20))
    plt.suptitle(cat_ag);
    # plt.savefig(PATH_TO_IMAGES + 'by_category_{}.png'.format(cat_ag), dip=800) 
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\seaborn\distributions.py:423: UserWarning: The following kwargs were not used by contour: 'figsize'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\seaborn\distributions.py:423: UserWarning: The following kwargs were not used by contour: 'figsize'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\seaborn\distributions.py:423: UserWarning: The following kwargs were not used by contour: 'figsize'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
D:\Anaconda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Wall time: 40.5 s

Density evaluation of reported crimes for "Inner cities" of Detroit

Data is not provided, but the problem is known and exists. Let's try to evaluate the crimes and offenses amount inside the inner but non-accountable parts of a city

Historically, both Hamtramck and Highland Park weren't the parts of Detroit, thus they don't relate to the Detroit Police Departmentб its statistics and even its duty, so we haven't got a precise number of offenses from the two regions inside of a Detroit. We understand that these old borders do not stop crimes.

In [13]:
import warnings
warnings.filterwarnings('ignore')
In [14]:
%%time
for indx, year in enumerate(sorted(df.year.unique(), reverse=True)):
    g = sns.jointplot(df[df.year == year]['X'], df[df.year == year]['Y'], kind='kde', figsize=(20, 20)) # aplha=0.005
    plt.suptitle(year);
#     g.savefig(PATH_TO_IMAGES + 'density_evaluation_{}.png'.format(year), dip=600)
plt.show();
Wall time: 2min 20s

What might be improved? (EN)

Analytically:

  • Layer of a map and internal division border for all geo- visual representation. E.g., as Kaggle competition at 2014;
  • Prepare analysis by districts and municipalities. With shape-files of Citi division, we will be able to prepare more insightful, deeper and 'targeted' analytical outputs on the topic, more community-related trends;
  • Advise with external experts or/and professionals about nature of the data, pitfalls and next prospectus tasks and problems;

Technically

  • Auto-ETL of new and actual, 'fresh' data from the primary data source;
  • Legend to 3d maps;
  • Layer of a map and internal division borders. Probably, change the technology used from pandas to geopandas;

Continuous Improvement Culture picture

Что можно и нужно добавить? (RU)

Аналитически:

  • Карту-подложку и внутренние границы на все географические графические представления. Например, как в соревновании на кэгл в 2014 году.
  • Анализ по районам и муниципалитетам. Получив шейп-карты с делением города, мы сможем подготовить более глубинные и таргетированные выводы по теме, более приземленные тренды.
  • Проконсультироваться с внешними экспертами и/или профессионалами о природе данных.

Технически

  • Автовыгрузку свежих данных из первоисточника
  • Легенду к 3Д картам
  • Карта подложка и внутренние границы. Возможно, сменить технологию с pandas на geopandas.

To contents

In [ ]: